{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "jieba 的分词算法\n",
    "主要有以下三种：\n",
    "\n",
    "基于统计词典，构造前缀词典，基于前缀词典对句子进行切分，得到所有切分可能，根据切分位置，构造一个有向无环图（DAG）；\n",
    "基于DAG图，采用动态规划计算最大概率路径（最有可能的分词结果），根据最大概率路径分词；\n",
    "对于新词(词库中没有的词），采用有汉字成词能力的 HMM 模型进行切分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache /var/folders/yw/k7z_d_3567g16ss9plk47x9w0000gn/T/jieba.cache\n",
      "Loading model cost 0.715 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "现如今/，/机器/学习/和/深度/学习/带动/人工智能/飞速/的/发展/，/并/在/图片/处理/、/语音/识别/领域/取得/巨大成功/。\n"
     ]
    }
   ],
   "source": [
    "import jieba\n",
    "\n",
    "content = \"现如今，机器学习和深度学习带动人工智能飞速的发展，并在图片处理、语音识别领域取得巨大成功。\"\n",
    "# 精确分词\n",
    "# 精确分词：精确模式试图将句子最精确地切开，精确分词也是默认分词。\n",
    "segs_1 = jieba.cut(content, cut_all=False)\n",
    "print(\"/\".join(segs_1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 全模式\n",
    "\n",
    "# 全模式分词：把句子中所有的可能是词语的都扫描出来，速度非常快，但不能解决歧义。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "现如今/如今///机器/学习/和/深度/学习/带动/动人/人工/人工智能/智能/飞速/的/发展///并/在/图片/处理///语音/识别/领域/取得/巨大/巨大成功/大成/成功//\n"
     ]
    }
   ],
   "source": [
    "segs_3 = jieba.cut(content, cut_all=True)\n",
    "print(\"/\".join(segs_3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 搜索引擎模式\n",
    "\n",
    "# 搜索引擎模式：在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "如今/现如今/，/机器/学习/和/深度/学习/带动/人工/智能/人工智能/飞速/的/发展/，/并/在/图片/处理/、/语音/识别/领域/取得/巨大/大成/成功/巨大成功/。\n"
     ]
    }
   ],
   "source": [
    "segs_4 = jieba.cut_for_search(content)\n",
    "print(\"/\".join(segs_4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['现如今', '，', '机器', '学习', '和', '深度', '学习', '带动', '人工智能', '飞速', '的', '发展', '，', '并', '在', '图片', '处理', '、', '语音', '识别', '领域', '取得', '巨大成功', '。']\n"
     ]
    }
   ],
   "source": [
    "segs_5 = jieba.lcut(content)\n",
    "print(segs_5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# jieba 可以很方便地获取中文词性，通过 jieba.posseg 模块实现词性标注。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('现如今', 't'), ('，', 'x'), ('机器', 'n'), ('学习', 'v'), ('和', 'c'), ('深度', 'ns'), ('学习', 'v'), ('带动', 'v'), ('人工智能', 'n'), ('飞速', 'n'), ('的', 'uj'), ('发展', 'vn'), ('，', 'x'), ('并', 'c'), ('在', 'p'), ('图片', 'n'), ('处理', 'v'), ('、', 'x'), ('语音', 'n'), ('识别', 'v'), ('领域', 'n'), ('取得', 'v'), ('巨大成功', 'nr'), ('。', 'x')]\n"
     ]
    }
   ],
   "source": [
    "import jieba.posseg as psg\n",
    "print([(x.word,x.flag) for x in psg.lcut(content)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 并行分词\n",
    "\n",
    "# 并行分词原理为文本按行分隔后，分配到多个 Python 进程并行分词，最后归并结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "jieba.enable_parallel(4) # 开启并行分词模式，参数为并行进程数 。\n",
    "jieba.disable_parallel() # 关闭并行分词模式 。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 获取分词结果中词列表的 top n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('，', 2), ('学习', 2), ('现如今', 1), ('机器', 1), ('和', 1)]\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "top5= Counter(segs_5).most_common(5)\n",
    "print(top5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 自定义添加词和字典\n",
    "\n",
    "# 默认情况下，使用默认分词，是识别不出这句话中的“铁甲网”这个新词，这里使用用户字典提高分词准确性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['铁甲', '网是', '中国', '最大', '的', '工程机械', '交易平台', '。']\n"
     ]
    }
   ],
   "source": [
    "txt = \"铁甲网是中国最大的工程机械交易平台。\"\n",
    "print(jieba.lcut(txt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['铁甲网', '是', '中国', '最大', '的', '工程机械', '交易平台', '。']\n"
     ]
    }
   ],
   "source": [
    "# 手动添加词典\n",
    "jieba.add_word(\"铁甲网\")\n",
    "print(jieba.lcut(txt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['铁甲网', '是', '中国', '最大', '的', '工程机械', '交易平台', '。']\n"
     ]
    }
   ],
   "source": [
    "jieba.load_userdict('user_dict.txt')\n",
    "print(jieba.lcut(txt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[现如今/t, ，/w, 机器学习/gi, 和/cc, 深度/n, 学习/v, 带动/v, 人工智能/n, 飞速/d, 的/ude1, 发展/vn, ，/w, 并/cc, 在/p, 图片/n, 处理/vn, 、/w, 语音/n, 识别/vn, 领域/n, 取得/v, 巨大/a, 成功/a, 。/w]\n"
     ]
    }
   ],
   "source": [
    "from pyhanlp import *\n",
    "content = \"现如今，机器学习和深度学习带动人工智能飞速的发展，并在图片处理、语音识别领域取得巨大成功。\"\n",
    "print(HanLP.segment(content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[铁甲/n, 网/n, 是/vshi, 中国/ns, 最大/gm, 的/ude1, 工程机械/nz, 交易平台/nz, 。/w]\n"
     ]
    }
   ],
   "source": [
    "txt = \"铁甲网是中国最大的工程机械交易平台。\"\n",
    "print(HanLP.segment(txt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[铁甲网/nz, 是/vshi, 中国/ns, 最大/gm, 的/ude1, 工程机械/nz, 交易平台/nz, 。/w]\n"
     ]
    }
   ],
   "source": [
    "CustomDictionary.add(\"铁甲网\")\n",
    "CustomDictionary.insert(\"工程机械\", \"nz 1024\")\n",
    "CustomDictionary.add(\"交易平台\", \"nz 1024 n 1\")\n",
    "print(HanLP.segment(txt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
